intent recognition
Transformer-Driven Triple Fusion Framework for Enhanced Multimodal Author Intent Classification in Low-Resource Bangla
Islam, Ariful, Mahmud, Tanvir, Hossen, Md Rifat
The expansion of the Internet and social networks has led to an explosion of user-generated content. Author intent understanding plays a crucial role in interpreting social media content. This paper addresses author intent classification in Bangla social media posts by leveraging both textual and visual data. Recognizing limitations in previous unimodal approaches, we systematically benchmark transformer-based language models (mBERT, DistilBERT, XLM-RoBERTa) and vision architectures (ViT, Swin, SwiftFormer, ResNet, DenseNet, MobileNet), utilizing the Uddessho dataset of 3,048 posts spanning six practical intent categories. We introduce a novel intermediate fusion strategy that significantly outperforms early and late fusion on this task. Experimental results show that intermediate fusion, particularly with mBERT and Swin Transformer, achieves 84.11% macro-F1 score, establishing a new state-of-the-art with an 8.4 percentage-point improvement over prior Bangla multimodal approaches. Our analysis demonstrates that integrating visual context substantially enhances intent classification. Cross-modal feature integration at intermediate levels provides optimal balance between modality-specific representation and cross-modal learning. This research establishes new benchmarks and methodological standards for Bangla and other low-resource languages. We call our proposed framework BangACMM (Bangla Author Content MultiModal).
DialogGraph-LLM: Graph-Informed LLMs for End-to-End Audio Dialogue Intent Recognition
Liu, HongYu, Li, Junxin, Guo, Changxi, Chen, Hao, Huang, Yaqian, Guo, Yifu, Yang, Huan, Cai, Lihua
Recognizing speaker intent in long audio dialogues among speakers has a wide range of applications, but is a non-trivial AI task due to complex inter-dependencies in speaker utterances and scarce annotated data. To address these challenges, an end-to-end framework, namely DialogGraph-LLM, is proposed in the current work. DialogGraph-LLM combines a novel Multi-Relational Dialogue Attention Network (MR-DAN) architecture with multimodal foundation models (e.g., Qwen2.5-Omni-7B) for direct acoustic-to-intent inference. An adaptive semi-supervised learning strategy is designed using LLM with a confidence-aware pseudo-label generation mechanism based on dual-threshold filtering using both global and class confidences, and an entropy-based sample selection process that prioritizes high-information unlabeled instances. Extensive evaluations on the proprietary MarketCalls corpus and the publicly available MIntRec 2.0 benchmark demonstrate DialogGraph-LLM's superiority over strong audio and text-driven baselines. The framework demonstrates strong performance and efficiency in intent recognition in real world scenario audio dialogues, proving its practical value for audio-rich domains with limited supervision.
Conversational Intent-Driven GraphRAG: Enhancing Multi-Turn Dialogue Systems through Adaptive Dual-Retrieval of Flow Patterns and Context Semantics
Zhu, Ziqi, Hu, Tao, Zhang, Honglong, Yang, Dan, Chen, HanGeng, Zhang, Mengran, Chen, Xilun
We present CID-GraphRAG (Conversational Intent-Driven Graph Retrieval Augmented Generation), a novel framework that addresses the limitations of existing dialogue systems in maintaining both contextual coherence and goal-oriented progression in multi-turn customer service conversations. Unlike traditional RAG systems that rely solely on semantic similarity (Conversation RAG) or standard knowledge graphs (GraphRAG), CID-GraphRAG constructs dynamic intent transition graphs from goal achieved historical dialogues and implements a dual-retrieval mechanism that adaptively balances intent-based graph traversal with semantic search. This approach enables the system to simultaneously leverage both conversional intent flow patterns and contextual semantics, significantly improving retrieval quality and response quality. In extensive experiments on real-world customer service dialogues, we employ both automatic metrics and LLM-as-judge assessments, demonstrating that CID-GraphRAG significantly outperforms both semantic-based Conversation RAG and intent-based GraphRAG baselines across all evaluation criteria. Quantitatively, CID-GraphRAG demonstrates substantial improvements over Conversation RAG across automatic metrics, with relative gains of 11% in BLEU, 5% in ROUGE-L, 6% in METEOR, and most notably, a 58% improvement in response quality according to LLM-as-judge evaluations. These results demonstrate that the integration of intent transition structures with semantic retrieval creates a synergistic effect that neither approach achieves independently, establishing CID-GraphRAG as an effective framework for addressing the challenges of maintaining contextual coherence and goal-oriented progression in knowledge-intensive multi-turn dialogues.
MVCL-DAF++: Enhancing Multimodal Intent Recognition via Prototype-Aware Contrastive Alignment and Coarse-to-Fine Dynamic Attention Fusion
Huang, Haofeng, Han, Yifei, Zhang, Long, Li, Bin, He, Yangfan
ABSTRACT Multimodal intent recognition (MMIR) suffers from weak semantic grounding and poor robustness under noisy or rare-class conditions. We propose MVCL-DAF++, which extends MVCL-DAF with two key modules: (1) Prototype-aware contrastive alignment, aligning instances to class-level prototypes to enhance semantic consistency; and (2) Coarse-to-fine attention fusion, integrating global modality summaries with token-level features for hierarchical cross-modal interaction. These results demonstrate the effectiveness of prototype-guided learning and coarse-to-fine fusion for robust multimodal understanding. Index T erms-- Multimodal intent recognition, Prototype-aware contrastive alignment, Coarse-to-fine dynamic attention fusion 1. INTRODUCTION Multimodal intent recognition (MMIR) [1] aims to infer user intentions by integrating heterogeneous signals such as spoken language, facial expressions, and vocal intonations. With the rapid adoption of human-centered AI systems [2], robust and generalizable multimodal understanding has become a cornerstone for building intelligent conversational agents [3, 4].
Tactile-Based Human Intent Recognition for Robot Assistive Navigation
Peng, Shaoting, Crowder, Dakarai, Yuan, Wenzhen, Driggs-Campbell, Katherine
Abstract-- Robot assistive navigation (RAN) is critical for enhancing the mobility and independence of the growing population of mobility-impaired individuals. However, existing systems often rely on interfaces that fail to replicate the intuitive and efficient physical communication observed between a person and a human caregiver, limiting their effectiveness. In this paper, we introduce T ac-Nav, a RAN system that leverages a cylindrical tactile skin mounted on a Stretch 3 mobile manipulator to provide a more natural and efficient interface for human navigational intent recognition. T o robustly classify the tactile data, we developed the Cylindrical Kernel Support V ector Machine (CK-SVM), an algorithm that explicitly models the sensor's cylindrical geometry and is consequently robust to the natural rotational shifts present in a user's grasp. Comprehensive experiments were conducted to demonstrate the effectiveness of our classification algorithm and the overall system. Results show that CK-SVM achieved superior classification accuracy on both simulated (97.1%) and real-world (90.8%) datasets compared to four baseline models. Furthermore, a pilot study confirmed that users more preferred the T ac-Nav tactile interface over conventional joystick and voice-based controls. I. INTRODUCTION Robot assistive navigation (RAN) - the task of a robot providing physical support while people moving from one place to another - is of critical importance for people with mobility impairments [1]. In the U.S., 12.2% of adults live with a mobility disability, and the aging population suggests an increasing need for navigation assistance [2], [3].
Evaluating Multi-Turn Bargain Skills in LLM-Based Seller Agent
Wang, Issue Yishu, Chong, Kakam, Wang, Xiaofeng, Yan, Xu, Kong, DeXin, Ju, Chen, Chen, Ming, Xiao, Shuai, Han, Shuguang, chen, jufeng
In online second-hand marketplaces, multi-turn bargaining is a crucial part of seller-buyer interactions. Large Language Models (LLMs) can act as seller agents, negotiating with buyers on behalf of sellers under given business constraints. A critical ability for such agents is to track and accurately interpret cumulative buyer intents across long negotiations, which directly impacts bargaining effectiveness. We introduce a multi-turn evaluation framework for measuring the bargaining ability of seller agents in e-commerce dialogues. The framework tests whether an agent can extract and track buyer intents. Our contributions are: (1) a large-scale e-commerce bargaining benchmark spanning 622 categories, 9,892 products, and 3,014 tasks; (2) a turn-level evaluation framework grounded in Theory of Mind (ToM) with annotated buyer intents, moving beyond outcome-only metrics; and (3) an automated pipeline that extracts reliable intent from massive dialogue data.
Enhancing Low-Altitude Airspace Security: MLLM-Enabled UAV Intent Recognition
Lei, Guangyu, Liang, Tianhao, Ping, Yuqi, Chen, Xinglin, Zhou, Longyu, Wu, Junwei, Zhang, Xiyuan, Ding, Huahao, Zhang, Xingjian, Yuan, Weijie, Zhang, Tingting, Zhang, Qinyu
The rapid development of the low-altitude economy emphasizes the critical need for effective perception and intent recognition of non-cooperative unmanned aerial vehicles (UAVs). The advanced generative reasoning capabilities of multimodal large language models (MLLMs) present a promising approach in such tasks. In this paper, we focus on the combination of UAV intent recognition and the MLLMs. Specifically, we first present an MLLM-enabled UAV intent recognition architecture, where the multimodal perception system is utilized to obtain real-time payload and motion information of UAVs, generating structured input information, and MLLM outputs intent recognition results by incorporating environmental information, prior knowledge, and tactical preferences. Subsequently, we review the related work and demonstrate their progress within the proposed architecture. Then, a use case for low-altitude confrontation is conducted to demonstrate the feasibility of our architecture and offer valuable insights for practical system design. Finally, the future challenges are discussed, followed by corresponding strategic recommendations for further applications.
Do Large Language Models Need Intent? Revisiting Response Generation Strategies for Service Assistant
Bolshinsky, Inbal, Kupiec, Shani, Sasson, Almog, Aperstein, Yehudit, Apartsin, Alexander
In the era of conversational AI, generating accurate and contextually appropriate service responses remains a critical challenge. A central question remains: Is explicit intent recognition a prerequisite for generating high-quality service responses, or can models bypass this step and produce effective replies directly? This paper conducts a rigorous comparative study to address this fundamental design dilemma. Leveraging two publicly available service interaction datasets, we benchmark several state-of-the-art language models, including a fine-tuned T5 variant, across both paradigms: Intent-First Response Generation and Direct Response Generation. Evaluation metrics encompass both linguistic quality and task success rates, revealing surprising insights into the necessity or redundancy of explicit intent modelling. Our findings challenge conventional assumptions in conversational AI pipelines, offering actionable guidelines for designing more efficient and effective response generation systems.
LLM-Guided Semantic Relational Reasoning for Multimodal Intent Recognition
Zhou, Qianrui, Xu, Hua, Wang, Yifan, Dong, Xinzhi, Zhang, Hanlei
Understanding human intents from multimodal signals is critical for analyzing human behaviors and enhancing human-machine interactions in real-world scenarios. However, existing methods exhibit limitations in their modality-level reliance, constraining relational reasoning over fine-grained semantics for complex intent understanding. This paper proposes a novel LLM-Guided Semantic Relational Reasoning (LGSRR) method, which harnesses the expansive knowledge of large language models (LLMs) to establish semantic foundations that boost smaller models' relational reasoning performance. Specifically, an LLM-based strategy is proposed to extract fine-grained semantics as guidance for subsequent reasoning, driven by a shallow-to-deep Chain-of-Thought (CoT) that autonomously uncovers, describes, and ranks semantic cues by their importance without relying on manually defined priors. Besides, we formally model three fundamental types of semantic relations grounded in logical principles and analyze their nuanced interplay to enable more effective relational reasoning. Extensive experiments on multimodal intent and dialogue act recognition tasks demonstrate LGSRR's superiority over state-of-the-art methods, with consistent performance gains across diverse semantic understanding scenarios. The complete data and code are available at https://github.com/thuiar/LGSRR.
Multilingual Datasets for Custom Input Extraction and Explanation Requests Parsing in Conversational XAI Systems
Wang, Qianli, Anikina, Tatiana, Feldhus, Nils, Ostermann, Simon, Splitt, Fedor, Li, Jiaao, Tsoneva, Yoana, Möller, Sebastian, Schmitt, Vera
Conversational explainable artificial intelligence (ConvXAI) systems based on large language models (LLMs) have garnered considerable attention for their ability to enhance user comprehension through dialogue-based explanations. Current ConvXAI systems often are based on intent recognition to accurately identify the user's desired intention and map it to an explainability method. While such methods offer great precision and reliability in discerning users' underlying intentions for English, a significant challenge in the scarcity of training data persists, which impedes multilingual generalization. Besides, the support for free-form custom inputs, which are user-defined data distinct from pre-configured dataset instances, remains largely limited. To bridge these gaps, we first introduce MultiCoXQL, a multilingual extension of the CoXQL dataset spanning five typologically diverse languages, including one low-resource language. Subsequently, we propose a new parsing approach aimed at enhancing multilingual parsing performance, and evaluate three LLMs on MultiCoXQL using various parsing strategies. Furthermore, we present Compass, a new multilingual dataset designed for custom input extraction in ConvXAI systems, encompassing 11 intents across the same five languages as MultiCoXQL. We conduct monolingual, cross-lingual, and multilingual evaluations on Compass, employing three LLMs of varying sizes alongside BERT-type models.